1) Reparition des notes dans le dataset
2) TOP 5 des assureurs les plus représenter
3) TOP 5 des assureurs les moins representer: il faudrait surement supprimer les 2 ou 3 derniers
4) Repartition des produits dans le dataset : il faudrait surement supprimer les 2 ou 3 dernier produits
Pre proccessing on the columns 'avis'
1) Wordclouds des avis selon les notes
2) Evolution des notes moyennes par assureur
3) WordCloud des avis de l'assurance 'ZEN UP' = elle possède la meilleure note moyenne
4) WordCloud des avis de l'assurance 'LCL' = elle possède la pire note moyene
4)Evolution des note moyennes par produit
5) Wordcloud des avis des produits auto : meilleur note moyenne
6) Wordcloud des avis des produits vie : pire note moyenne
1st try : we count the frequency of every word in the whole dataframe
We will try a another method, more flexible for each review We are going to retrieve the number of times each word is repeated in categories
Now we are going to add the sentiment of the word in each review. Avis3 is the sentiment of the words in the reviews column 'avis2' Avis4 is the sentiment of the words in the reviews column 'avis'
Now we can construct a word2vec model with avis4 in order to work on unsupervised model (see the report for explaination)
We reduce de dimensionality with pca
1st model : kmeans
we add the sentiment of the whole reviews to the model kmeans and we will try again the model
2nd model
The features
The Target
We split train and test set to build our models.
we don't have good result on this model. We keep xgb regressor to do our prediction on the test set
We do the pre processing on the test set now
This result show us that the pre processing is the same and as good as for the train dataset
We can now predict the rating with XGB regressor